Executive Summary

The World Happiness Report is downloaded from www.kaggle.com. Happiness scored according to 6 factors - economic production, social support, life expectancy, freedom, absence of corruption, and generosity.

Description and EDA

The happiness survey asked the Cantril ladder question that requests participants to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. I’m interested in: (1) the overall differences between Western Europe and Middle East and Northern Africa in 2015 (see table 1 and table 2). (2) the relationship between economy and happiness score (see scatterplot)

Table 1: Descriptive Statistics for Observed Variables in Western Europe
Variable Mean SD min max % Missing
Dystopia Residual 2.15 0.38 1.26 2.70 0
Economy (GDP per Capita) 1.30 0.10 1.15 1.56 0
Family 1.25 0.14 0.89 1.40 0
Freedom 0.55 0.15 0.08 0.67 0
Generosity 0.30 0.13 0.00 0.52 0
Happiness Rank 29.52 29.27 1.00 102.00 0
Happiness Score 6.69 0.82 4.86 7.59 0
Health (Life Expectancy) 0.91 0.03 0.87 0.96 0
Standard Error 0.04 0.01 0.02 0.06 0
Trust (Government Corruption) 0.23 0.15 0.01 0.48 0
Table 2: Descriptive Statistics for Observed Variables in Middle East and Northern Africa
Variable Mean SD min max % Missing
Dystopia Residual 1.98 0.54 0.33 3.09 0
Economy (GDP per Capita) 1.07 0.32 0.55 1.69 0
Family 0.92 0.24 0.47 1.22 0
Freedom 0.36 0.17 0.00 0.64 0
Generosity 0.19 0.11 0.06 0.47 0
Happiness Rank 77.60 43.21 11.00 156.00 0
Happiness Score 5.41 1.10 3.01 7.28 0
Health (Life Expectancy) 0.71 0.11 0.40 0.91 0
Standard Error 0.05 0.01 0.03 0.08 0
Trust (Government Corruption) 0.18 0.13 0.05 0.52 0
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Hypotheses and Statistical Analysis

Group Comparison

Hypothesis:

H0: Happiness scores between Western Europe and Middle East and Northern Africa have no differences.
H1: Happiness scores between Western Europe and Middle East and Northern Africa are statistically different.

Normality test

## 
##  Shapiro-Wilk normality test
## 
## data:  `Happiness Score`[Region == "Western Europe"]
## W = 0.89516, p-value = 0.02823
## 
##  Shapiro-Wilk normality test
## 
## data:  `Happiness Score`[Region == "Middle East and Northern Africa"]
## W = 0.97138, p-value = 0.7837

One group is normally distributed, the other is not, so group comparison will use unpaired two-samples Wilcoxon test.

Homogeneity in Variances Check

Do the two populations have the same variances?

## 
##  F test to compare two variances
## 
## data:  Happiness Score by Region
## F = 1.7841, num df = 19, denom df = 20, p-value = 0.2077
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7187759 4.4760931
## sample estimates:
## ratio of variances 
##           1.784056

There’s no significant differences between the variances of the two sets of data (p > .05).

Wilcoxon Test

##   statistic      p.value                                            method
## 1        69 0.0002477983 Wilcoxon rank sum test with continuity correction
##   alternative
## 1   two.sided

Western Europe’s happiness score is significantly different from Middle East and Northern Africa’s score (p < .001).

Correlation

Normality Test

## 
##  Shapiro-Wilk normality test
## 
## data:  happy_2015$`Health (Life Expectancy)`
## W = 0.93521, p-value = 1.344e-06

## 
##  Shapiro-Wilk normality test
## 
## data:  happy_2015$`Economy (GDP per Capita)`
## W = 0.96502, p-value = 0.0004967

Both Health (Life Expectancy) and Economy (GDP per Capita) is not normally distributed (p<.001). Data needs to be transformed.

Kendall Correlation

## Call:corr.test(x = ., method = "kendall")
## Correlation matrix 
##                 Happiness Score log_health log_economy
## Happiness Score            1.00       0.55        0.59
## log_health                 0.55       1.00        0.65
## log_economy                0.59       0.65        1.00
## Sample Size 
## [1] 158
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                 Happiness Score log_health log_economy
## Happiness Score               0          0           0
## log_health                    0          0           0
## log_economy                   0          0           0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## HppnS-lg_hl      0.44  0.55      0.65     0      0.44      0.65
## HppnS-lg_cn      0.48  0.59      0.69     0      0.46      0.70
## lg_hl-lg_cn      0.55  0.65      0.73     0      0.53      0.75
## 
##  Kendall's rank correlation tau
## 
## data:  cor_data$`Happiness Score` and cor_data$log_health
## z = 10.334, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.5541848
## 
##  Kendall's rank correlation tau
## 
## data:  cor_data$`Happiness Score` and cor_data$log_economy
## z = 11.057, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##      tau 
## 0.592945

Kendall correlation indicated a moderately strong significant positive relationship between Happiness Score and Health(life expectancy) (r = .55 [95% CI: .44; .65], N = 158, p < .001), and similar significant positive relationship between Happiness Score and Economy (GDP per Captica) (r = .59 [95% CI: .48; .69] N = 158, p < .001).

Linear Regression

A simple linear regression was run to investigate the degree to which Health (Life Expectancy) predicts Happiness Score.

Table 3: Regression Results
Dependent variable:
Happiness Score
Health (Life Expectancy) 3.356***
(0.256)
Constant 3.261***
(0.173)
Observations 158
R2 0.524
Adjusted R2 0.521
Residual Std. Error 0.792 (df = 156)
F Statistic 172.052*** (df = 1; 156)
Note: p<0.1; p<0.05; p<0.01

A significant regression equation was found (F(1,156) = 172.05, p < .001), with an R^2 of 0.52. Participants’ predicted Happiness Score is equal to 3.26 + 3.36 (Health (Life Expectancy)). Participants’ happiness score increased 3.36 for each unit of Health (Life Expectancy).

Residual Diagnostics Plots

Let’s take a look at some plots of residual diagnostics.

A solid horizontal line distinguishes between positive and negative residuals, and roughly checking, they are equally scattered from both sides.

Quantile-Quantile plot seems approaching a straight line (normal distribution), which supports the linear model assumption about the distribution of the residuals.

Normality test on residuals

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(reg)
## W = 0.98078, p-value = 0.02677

With a p-value < .05, residuals are not normal distributed.

Conclusion

  1. Description and EDA: (1) two tables displayed some descriptive statistics for two regions, Western Europe and Middle East and Northern Africa. Some differences waited for further analysis. (2) Both two scatter plots showed linear relations between Happiness Score and Economy (GDP per Capital), and Happiness Score and Health (Life Expectancy).

  2. Group comparison and correlation analysis: (1) Western Europe’s happiness score is significantly different from Middle East and Northern Africa’s score (p < .001). (2) There are moderately strong significant positive relationships between Happiness Score and Economy (GDP per Capital), and Happiness Score and Health (Life Expectancy).

  3. Linear regression: Health (Life Expectancy) is a significant predictor on Happiness Score. All the residuals diagnotistic plots seem ok, but based on Shapiro test, residuals are not normally distributed.